Perceptual audio loss function for deep learning
نویسندگان
چکیده
PESQ, Perceptual Evaluation of Speech Quality [5], and POLQA, Perceptual Objective Listening Quality Assessment [1] , are standards comprising a test methodology for automated assessment of voice quality of speech as experienced by human beings. The predictions of those objective measures should come as close as possible to subjective quality scores as obtained in subjective listening tests, usually, a Mean Opinion Score (MOS) is predicted. Wavenet [6] is a deep neural network originally developed as a deep generative model of raw audio waveforms. Wavenet architecture is based on dilated causal convolutions, which exhibit very large receptive fields. In this short paper we suggest using the Wavenet architecture, in particular its large receptive filed in order to mimic PESQ algorithm. By doing so we can use it as a differentiable loss function for speech enhancement. 1 Problem formulation and related work In statistics, the Mean Squared Error (MSE) or Peak Signal to Noise Ratio (PSNR) of an estimator are widely used objective measures and are good distortion indicators (loss functions) between the estimators output and the size that we want to estimate. those loss functions are used for many reconstruction tasks. However, PSNR and MSE do not have good correlation with reliable subjective methods such as Mean Opinion Score (MOS) obtained from expert listeners. A more suitable speech quality assessment can by achieved by using tests that aim to achieve high correlation with MOS tests such as PEAQ or POLQA. However those algorithms are hard to represent as a differentiable function such as MSE moreover, as opposed to MSE that measures the average
منابع مشابه
Combining pattern recognition and deep-learning-based algorithms to automatically detect commercial quadcopters using audio signals (Research Article)
Commercial quadcopters with many private, commercial, and public sector applications are a rapidly advancing technology. Currently, there is no guarantee to facilitate the safe operation of these devices in the community. Three different automatic commercial quadcopters identification methods are presented in this paper. Among these three techniques, two are based on deep neural networks in whi...
متن کاملL2 Learners’ Lexical Inferencing: Perceptual Learning Style Preferences, Strategy Use, Density of Text, and Parts of Speech as Possible Predictors
This study was intended first to categorize the L2 learners in terms of their learning style preferences and second to investigate if their learning preferences are related to lexical inferencing. Moreover, strategies used for lexical inferencing and text related issues of text density and parts of speech were studied to determine their moderating effects and the best predictors of lexical infe...
متن کاملDeep Cross-Modal Correlation Learning for Audio and Lyrics in Music Retrieval
Deep cross-modal learning has successfully demonstrated excellent performances in cross-modal multimedia retrieval, with the aim of learning joint representations between different data modalities. Unfortunately, little research focuses on cross-modal correlation learning where temporal structures of different data modalities such as audio and lyrics are taken into account. Stemming from the ch...
متن کاملGenerating Images with Perceptual Similarity Metrics based on Deep Networks
Image-generating machine learning models are typically trained with loss functions based on distance in the image space. This often leads to over-smoothed results. We propose a class of loss functions, which we call deep perceptual similarity metrics (DeePSiM), that mitigate this problem. Instead of computing distances in the image space, we compute distances between image features extracted by...
متن کاملTowards minimum perceptual error training for DNN-based speech synthesis
We propose to use a perceptually-oriented domain to improve the quality of text-to-speech generated by deep neural networks (DNNs). We train a DNN that predicts the parameters required for speech reconstruction but whose cost function is calculated in another domain. In this paper, to represent this perceptual domain we extract an approximated version of the SpectroTemporal Excitation Pattern t...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/1708.05987 شماره
صفحات -
تاریخ انتشار 2017